group_by() & summarise()

group_by() & summarise()
pivot_longer()
pivot_wider()
tidy
Author

Sean Conway

Published

December 24, 2023

knitr::opts_chunk$set(echo = T)
library(here)
library(tidyverse)
library(readxl)

Overview

Today we’re going to talk about tidy data, and how to use pivot_longer() and pivot_wider(). It may not be clear exactly why we’re doing things this way, but it will become explicitly clear when we begin data visualization.

tidy data

According to Wickham & Grolemund, there are three principles for tidy data.

  • Each variable must have its own column.

  • Each observation must have its own row.

  • Each value must have its own cell.

This is often easier said than done. However, we’re going to get a little bit of practice tidying data using the tidyr functions pivot_longer() and pivot_wider().

Read in data

livestock <- here("posts","_data","FAOSTAT_livestock.csv") %>%
  read_csv()
Rows: 82116 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Domain Code, Domain, Area, Element, Item, Unit, Flag, Flag Description
dbl (6): Area Code, Element Code, Item Code, Year Code, Year, Value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Pivoting

We’re going to practice with the FAOSTAT_livestock.csv dataset.

livestock

First, we’re going to remove the Code columns. We can get the same information out of the corresponding “non-code” columns, and this cleans things up for us. I will also remove the Flag Column.

Finally, if we use the unique() function to identify all the unique values in a column, we see that every value of Unit is “Head”, every value of Domain is “Live Animals”, and every value of Element is “Stocks”. If every value is the same, we can remove these columns.

unique(livestock$Unit)
[1] "Head"
unique(livestock$Domain)
[1] "Live Animals"
unique(livestock$Element)
[1] "Stocks"
livestock_1 <- livestock %>%
  select(c(-contains("Code"),
           -contains("Flag"),
           -Unit,
           -Domain,
           -Element)) # use the select() function to choose the columns I'm keeping, contains() to pick the columns containing the string "Code", and the - operator to tell select() that I want all the columns that DO NOT contain "Code" in their titles
# also use - operator to remove Flag & Flag Description
# also use - operator to remove Head
# also use - operator to remove Domain
livestock_1

Here, we notice that there are observations scattered across multiple rows. For example, if we sort by Year, we can see that each the Item column contains the livestock being measured, and Value contains the amount of livestock. This should probably be spread across columns - i.e,. one column for Camel counts, another for Goat counts, etc.

livestock_1 %>%
  arrange(Year)

That is, the data are too long. We can fix this by using pivot_wider() to take these observations and spread them across columns.

livestock_2 <- livestock_1 %>%
  pivot_wider(names_from = Item,
              values_from = Value)
livestock_2

Above, I used pivot_wider() to create columns for each unique type of livestock, where the rows contain the amounts of each livestock in a country during a specific year.

I can use pivot_longer() to get it back to the previous format. Note that I’m always creating a data frame /tibble with a new name rather than modifying an existing one.

livestock_3 <- livestock_2 %>%
  pivot_longer(c(Asses,Camels,Cattle,Goats,Horses,Mules,Sheep,Buffaloes,Pigs), # oh my!
  names_to = "Item", # the column containing these livestock names
  values_to = "Value")# the column containing these livestock counts # drop NA values
livestock_3               

There we are! However, note that if we compare livestock_3 and livestock_1, the former has a lot more rows than the latter. This is because livestock_1 had both explicit and implicit missing data. That is, some items were marked as NA, and others were not listed. When we pivoted wider, these implicit values became explicit, and this explicitness remained when we pivoted longer again.

Conclusion

Now you know a little bit about pivoting & tidy data. It takes a lot of practice, and we’ll work on it more this semester.